Abstract 



We describe an efficient implementation of a hierarchy of algorithms for multiplication of 
dense matrices over the field with two elements (F2). In particular we present our implemen- 
tation - in the M4RI library - of Strassen-Winograd matrix multiplication and the "Method 
of the Four Russians" multiplication (M4RM) and compare it against other available imple- 
mentations. Good performance is demonstrated on on AMD's Opteron and particulary good 
performance on Intel's Core 2 Duo. The open-source M4RI library is available stand-alone as 
well as part of the Sage mathematics software. 

In machine terms, addition in F2 is logical-XOR, and multiplication is logical- AND, thus 
a machine word of 64-bits allows one to operate on 64 elements of F2 in parallel; at most 
one CPU cycle for 64 parallel additions or multiplications. As such, element-wise operations 
over F2 are relatively cheap. In fact, in this paper, we conclude that the actual bottlenecks 
are memory reads and writes and issues of data locality. We present our empirical findings in 
relation to minimizing these and give an analysis thereof. 
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1 Introduction 

We describe an efficient implementation of a hierarchy of algorithms for multiplication of dense 
matrices over the field with two elements (F2). Matrix-matrix multiplication is an important 
primitive in computational linear algebra and as such the fundamental algorithms wc implement 
have been well-known for some time. Therefore this paper focuses on the numerous techniques 
employed for the special case of F2 in the M4RI library (http://m4ri.sagemath.org) and the 
benefits so derived. 

We note that even for problems that do not reduce to matrix-matrix multiplication many of 
the techniques presented in this paper are still applicable. For instance, Gaussian Elimination can 
be achieved via the "Method of the Four Russians" Inversion (M4RI)(cf. [5, Ch. 5] and [3]) and 
borrows ideas from the "Method of the Four Russians" Multiplication (M4RM) [2] , [1] which we 
present here. 

The M4RI library implements dense linear algebra over F2 and is used by Sage [16] and 
PolyBoRi [7]. 

Our optimization efforts focus on 64 bit x86 architectures (x86_64), specifically the Intel Core 

2 Duo and the AMD Opteron. Thus, we assume in this paper that each native CPU word has 
64-bits: Wg = 64. However it should be noted that our code also runs on 32-bit CPUs and on 
non-x86 CPUs such as the PowerPC. 

In machine terms, addition in F2 is logical-XOR, and multiplication is logical-AND, thus a 
machine word of 64-bits allows one to operate on 64 elements of F2 in parallel: at most one CPU 
cycle for 64 parallel additions or multiplications. As such, element-wise operations over F2 are 
relatively cheap. In fact, in this paper, we conclude that the actual bottlenecks are memory reads 
and writes and issues of data locality. We present our empirical findings in relation to minimizing 
these and give an analysis thereof. 

The second author proposed, in [4] and [5, Ch. 5], to count memory accesses rather than 
arithmetic operations to estimate the complexity of such algorithms and the empirical results of 
this paper lend further support to this model. However, this model is a simplification as memory 
access is not uniform, i.e. an algorithm which randomly accesses memory will perform much worse 
than an algorithm with better spatial and temporal locality. While these differences only affect 
the constant of a complexity estimation, in practice they make a very significant difference, as our 
results will demonstrate. 

The paper is structured as follows. Wc proceed from basic arithmetic (Section 2) via the 
classical cubic multiplication algorithm (Section 2.3), through a detailed discussion of the "Method 
of the Four Russians" (Section 3) to the Strassen-Winograd algorithm (Section 4). We start by 
introducing our basic data structures and conclude by presenting timing experiments to show the 



2 



validity of our approach (Section 6). Note, that all timings in this paper time Strassen-Winograd 
multiplication (cf. Section 4) but with different base cases. 

2 Basic Arithmetic 

2.1 Our Matrix Data Structure 

We use a "flat row-major representation" for our matrices. Thus 64 consecutive entries in one row 
are packed into one machine word. Consequently, bulk operations on whole rows are considerably 
cheaper than on whole columns and addressing a single column is more expensive than addressing 
a single row. Additionally, we maintain an array ~ called rowswap - containing the address in 
memory of the first word for each row in the matrix. To represent in-place submatrices (i.e. 
without copying out the data) we also use this rowswap array. We call these in-place submatrices 
"matrix windows" and they consist of addresses of the first word of each row and the number of 
columns each row contains. This approach is limited to "matrix windows" which start and end at 
full word borders but this is sufficient for our application. The advantages and disadvantages of 
the "flat row-major" data structure are, for instance, analyzed in [14]. 

2.2 Row Additions 

Since this basic operation - addition of two rows - is at the heart of every algorithm in this 
paper we should briefly mention the SSE2 instruction set [9] which is available on modern x86_64 
architectures. This instruction set offers an XOR operation for 128-bit wide registers, allowing one 
to handle two 64-bit machine words in one instruction. The use of these instructions docs provide 
a considerable speed improvement on Intel CPUs. Table 1 shows that up to a 25% improvement 
is possible when enabling SSE2 instructions. However, in our experiments performance declined 
on Opteron CPUs when using SSE2 instructions. The authors were unable to identify a cause of 
this phenomenon. 



Matrix Dimensions 


Using 64-bit 


Using 128-bit (SSE2) 


10,000 X 10,000 


1.981 


1.504 


16,384 X 16,384 


7,906 


6.074 


20,000 X 20,000 


14.076 


10.721 


32,000 X 32,000 


56.931 


43.197 



Table 1: Strassen-Winograd multiplication on 64-bit Linux, 2.33Ghz Core 2 Duo 



2.3 Cubic Multiplication 

The simplest multiplication operation involving matrices is a matrix-vector product which can 
easily be extended to classical cubic matrix-matrix multiplication. To compute the matrix-vector 
product Ab we have to compute the dot product of each row i oi A and the vector b. If the 
vector b is stored as a row rather than a column, this calculation becomes equivalent to word-wise 
logical- AND and accumulation of the result in a word p via logical-XOR. Finally, the parity of p 
needs to be computed. However, as there is no native parity instruction in the x86_64 instruction 
set this last step is quite expensive compared to the rest of the routine. To account for this, 64 
parity bits can be computed in parallel [18, Ch. 5]. To extend this matrix- vector multiplication 
to matrix-matrix multiplication B must be stored transposed. 
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3 The Method of the Four Russians 



The "Method of the Four Russians" matrix multipUcation algorithm can be derived from the 
original algorithm published by Arlazarov, Dinic, Kronrod, and Faradzev [2], but does not directly 
appear there. It has appeared in books including [1, Ch. 6]. 

Consider a product of two matrices C = AB where A is an to x Z matrix and B is an Z x n 
matrix, yielding an m x n for C. A can be divided into l/k vertical "stripes" Aq . . . A(j__iyk of 
k columns each, and B into l/k horizontal stripes Bq . . . of k rows each. (For simplicity 

assume k divides The product of two stripes. AiBi requires an m x l/k by l/k x n matrix 
multiplication, and yields an to x n matrix C;. The sum of all k of these Ci equals C. 

(i-i)/k 
C = AB= A,B^. 



Example: Consider k = 1 and 

\ a2 as J \ 02 03 J 

Then 




and consequently 

AoBo = ( ) and A,B, = ( ) . 

\ a20o 0201 J \ 0362 0363 J 

Finally, we have 

C^AB^ AoBo + A,B, = ( ""f" + ""^ + "^J^ ) . 

y 0260 + 0302 0201 + 0363 J 

The principal benefit of multiplying in narrow stripes is that the bits across each row of a stripe 
of A determine which linear combination of rows of B will contribute to the product, e.g. in the 
above example oq, . . . , 03 dictate which linear combination of bo, 62 and hi, 63 must be written to 
the rows of C. However, if the stripe is relatively narrow as in this example, there is only a small 
number of binary values each row of the stripe can take, and thus only a small number of possible 
linear combinations of the rows of B that will be "selected" . If we precompute all possible linear 
combinations of rows of B that could be selected we can create a lookup table into which the rows 
of the stripes of A can index. 

Returning to our example, if oq = 02 and oi = 03 then the same linear combination would 
be written to the first and the second row of C. Precomputation of all 2^ — 1 non-zero linear 
combinations, {1 ■ bo + ■ bi, ■ bo + 1 ■ bi, 1 ■ bo + 1 ■ bi) , ensures that the repeated linear combination 
has only been computed once. In our trivial example this is not a saving, but for much larger 
matrices reuse of the precomputed combinations gives a saving. Precomputing a table in this 
fashion is also called "greasing" . 

The technique just described gives rise to Algorithm 1. In Algorithm 1 the subroutine ReadBits (A, 
r, sc, k) reads k bits from row r starting at column sc and returns the bit string interpreted 
as an integer and AddRowFromTable (C , r, T, x) adds the row x from T to the row j of C. The 
subroutine MakeTable(B, r, c, k) in Algorithm 1 constructs a table T of all 2^ — 1 non-zero 
linear combinations of the rows of B starting in row r and column c. For this calculation Gray 
codes are used. 
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Algorithm 1 M4RM 



function AddRowFromTablc(C, ri, T, r2) begin 
for < z < NumbcrOfColumns(C) do begin 

end 

end 



function ReadBits(A, r, c, k) begin 

return Ar,c * 2''"i + Ar^c+i * S''"^ ^ Ar,c+2 * "2.^^^ ^ h A^x+fc-i * 2' 

end 



function McthodFourRussiansMultiplication(A, B, k) do begin 
m ^ Number OfRows (A) 
I <— NumbcrOfColumns(A) 
n <— NumbcrOfColumns(B) 
C ^ GenerateZeroMatrix(m, n) 

for < i < floor (^/fc) do begin 

/ /create table of 2^^ — 1 linear combinations 

T <~ MakeTable(B, i*k, 0, k) 

for < J < m do begin 

/ / read index for table T 
id ^ RcadBits(A, j, k*i, k) 
//add appropriate row from table T 
AddRowFromTable(C, j, T, id) 

end 

end 

return C 

end 
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3.1 Gray Codes 



The Gray code [11], named after Frank Gray and also known as reflected binary code, is a num- 
bering system where two consecutive values differ in only one digit. Examples of Gray codes for 
two, three and four bits are given in Figure 3.1. 



1 1 
11 11 

1 1 
2-bit Gray Code 110 

111 





1 

1 1 

1 

1 1 
1 1 1 
1 1 
1 

3-bit Gray Code 



10 1 
10 
110 
110 1 
1111 
1110 
10 10 
10 11 
10 1 
10 
4-bit Gray Code 



Figure 1: Gray Codes 

Gray code tables for n-bits can be computed efficiently from n — 1-bit Gray code tables by 
prepending each entry of the n — 1-bit Gray code table with 0. Then the order of the entries is 
reversed and a 1 is prepended to each entry. These two half-tables arc then concatenated. These 
tables can then be used to construct all 2*^ — 1 non-zero linear combinations of k rows where each 
new entry in the table costs one row addition as its index differs in exactly one bit from that of 
the preceding row. Thus computing all 2*^ — 1 non-zero linear combinations of k rows can be done 
in 2*^ — 1 row additions, rather than (fc/2 — 1)2'"" — 1 as would be expected if each vector were to 
be tabulated separately. 

From the complexity analysis in [4] it seems one should always choose the parameter k = 
[log2 n] for an n X n matrix. However, in practice this is not the case. First, experimental 
evidence indicates [5] that 0.75 x log2 n seems to be a better choice. Also, for cache efficiency it 
makes sense to split the input matrices into blocks such that these blocks fit into L2 cache (see 
below) . If that technique is employed then the block sizes dictate k and not the total dimensions 
of the input matrices. Thus, a much smaller k than log2 n is found to be optimal, in practice (see 
below); restraining k in this way actually improves performance. 

We pre-computc the Gray Code tables up to size 16. For matrices of dimension > 20 million 
rows and columns, this is not enough. But, such a dense matrix would have nearly half a quadrillion 
entries, and this is currently beyond the capabilities of existing computational hardware. Also, 
for these dimensions the Strasscn-Winograd algorithm should be used. 

3.2 A Cache Friendly Version 

Note that the M4RM algorithm creates a table for each stripe of B and then iterates over all rows 
of C and A in the inner loop. If the matrices C and A are bigger than L2 cache then this means 
that for each single row addition a new row needs to be loaded from RAM. This row will evict an 
older row from L2. However, as this row is used only once per iteration of all rows of A and C 
we cannot take advantage of the fact that it is now in L2 cache. Thus if the matrices A and C do 
not fit into L2 cache then the algorithm does not utilize this faster memory. 
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Thus, it is advantageous to re- arrange the algorithm in such a way that it iterates over the 
upper part of A completely with all tables for B before going on to the next part. This gives 
rise to Algorithm 2. a cache friendly version of the M4RM algorithm. For simplicity we assume 
that m, I, n are all multiples of some fixed block size in the presentation of Algorithm 2. This 



Algorithm 2 Cache Friendly M4RM 

function MethodOfFourRussiansCacheFriendlyMultipication(A, B, k) 
m ^ Number OfRows (A) 
£ ^ NumberOfColumns(A) 
n ^ NumberOfColumns(B) 
C ^ GenerateZeroMatrix(m, n) 

for < start < m/BlockSize do begin 
for <= i < £/k do begin 

T ^ MakeTable(B, i*k, 0, k) 
for < s < BlockSizc do begin 
j ^ start * BlockSizc + s 
X <— ReadBits(A, j, k*i, k) 
AddRowPromTable(C, j, T, id) 

end 

end 

end 

return C 

end 



cache-friendly rearrangement is paid for by the repeated regeneration of the table T. However, 
compared to the inner loop, this is a cheap operation and thus is outweighed by the better data 
locality. Table 2 shows that this strategy provides considerable performance improvements. 

3.3 Increasing the Number of Gray Code Tables 

Recall that the actual arithmetic is quite cheap compared to memory reads and writes and that 
the cost of memory accesses greatly depends on where in memory data is located: the LI cache is 
approximately 50 times faster than main memory. It is thus advantageous to try to fill all of LI 
with Gray code tables. For example consider n = 10000, A: = 10 and one Gray code table. In this 
situation we work on 10 bits at a time. If we use k = 9 and two Gray code tables, we still use the 
same memory for the tables but can deal with 18 bits at once. The price we pay is one additional 
row addition, which is cheap if the operands are all in cache. To implement this enhancement the 
algorithm remains almost unchanged, except that t tables are generated for tk consecutive rows 
o{ B, tk values x are read for consecutive entries in A and t rows from t different tables are added 
to the target row of C. This gives rise to Algorithm 3 where we assume that tk divides / and fix 
t = 2. 

Table 2 shows that increasing the number of tables is advantageous. Our implementation uses 
eight Gray code tables, which appears to be a good default value according to our experiments. 





"base cases" (cf. Section 5) 


Matrix Dimensions 


Algorithm 1 


Algorithm 2 


Algorithm 3, t = 2 


Algorithm 3, t = 8 


10, 000 X 10, 000 


4.141 


2.866 


1.982 


1.599 


16,384 X 16,384 


16.434 


12.214 


7.258 


6.034 


20, 000 X 20, 000 


29.520 


20.497 


14.655 


11.655 


32, 000 X 32, 000 


86.153 


82.446 


49.768 


44.999 



Table 2: Strassen-Winograd with different base cases on 64-bit Linux, 2.33Ghz Core 2 Duo 
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Algorithm 3 M4RM with Two Gray Code Tables 



function AddTwoRowsFromTable(C, ro, T, ri, TT, r2) do begin 
for <= i < NumbcrOfColuinns(C) do begin 

Cr^i ^ Cr,i -^ri/i TT'r2^i 

end 

end 

function MethodOfFourRussiansTwoTables(A, B, k) do begin 
m ^ Number OfRows (A) 
£ ^ NumberOfColumns(A) 
n <— NumberOfColumns(B) 
C <— GenerateZeroMatrix(m, n) 

for < i < e/{2 * k) do begin 

T ^ MakeTable(B, 2*i*k, 0, k) 

TT ^ MakeTable(B, 2*i*k + k, 0, k) 

for < J < m do begin 

ri ^ ReadBits(A, j, 2*k*i, k) 
r2 ^ ReadBits(A, j, 2*k*i+k, k) 
AddTwoRowsFromTable(C, j, T, n, TT, r2) 

end 

end 

return C 

end 



4 Strassen-Winograd Multiplication 

In 1969 Volker Strassen [17] published an algorithm which multiplies two block matrices 

/ Aoo Aoi\ f Boo Boi \ 
\ Aio All ) V / 

with only seven submatrix multiplications and 18 submatrix additions rather than eight multipli- 
cations and eight additions. As matrix multiplication (0(n"), w > 2) is considered more expensive 
than matrix addition (©(n^)) this is an improvement. Later the algorithm was improved by Wino- 
grad to use 15 submatrix additions only, the result is commonly referred to as Strassen-Winograd 
multiplication. While both algorithms are to a degree less numerically stable than classical cubic 
multiplication over floating point numbers [12, Ch. 26.3.2] this problem does not affect matrices 
over finite fields and thus the improved complexity of 0{Tn}°^'^'^) [17, 5] is applicable here. 

Let m, I and n be powers of two. Let A and B be two matrices of dimension m x I and I x n 
and \ei C = A y. B. Consider the block decomposition 

/ Coo Cqi \ f Aoo Aoi \ I Boo Boi \ 
\ Cio Cii J ^ \ Aio All J \ Bio Bii ) 

where Aoo and Boo have dimensions m/2xl/2 and //2 x n/2 respectively. The Strassen-Winograd 
algorithm, which computes the m x n matrix C ^ A x B, is given in Algorithm 4. 

The subroutine Augment in Algorithm 4 takes two m x I and m x n matrices A and B and 
returns the m x (n + I) matrix C ~ {A B) and the subroutine Stack takes two m x n and / x n 
matrices A and B and returns the (m + 1) x n matrix 




At each recursion step the matrix dimensions must be divisible by two which explains the 
requirement of them being powers of two. However, in practice the recursion stops at a given 
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Algorithm 4 Strasscn-Winograd 



function StrassenWinograd(A,B) do begin 






m <— NumberOfRows(A) 


in 


recursive multiplications 


£ ^ NumberOfColumns(A) 




— Multiply (^ATvi/, ^Tvw) 


n <— NumbcrOfColumns(B) 


Pi ^ 


- Multiply Mate, i?sw) 


Anw ^ SubMatrix(Ao,o • ■ • ^m/2-i,^/2-i) 


P2 ^ 


- Multiply(S'3,Bs£') 


Ane <— SubMatrix(Ao^i/2 ■ ■ • Am/2-i.e-i) 


P3 ' 


- Multiply(As£;,T3) 


Asw ^ SubMatrix(ylm/2,o • ■ • ^Tn-i//2-i) 


Pi ^ 


- Multiply(S'o,To) 


AsE ^ SubMatrix(yl„j/2,^/2 ■ • ■ Am-i,i-i) 


P5 ^ 


- Multiply (S'i,Ti) 




Pe ^ 


- Multiply S'2,T2 


Bnw <— SubMatrix(i3o,o • ■ • -Sf/2-i,„/2-i) 






Bne ^ SubMatrix(i?o „/2 ■ • ■ -B£/2-i,n-i) 


in 


final additions 


Bsw SubMatrix(i3£/2,o • • ■ 5£-i,n/2-i) 


Uo< 


-P^^Pi 


BsE <— SubMatrixfiJii/') „/'>... i?/-! „-i ) 


Ui < 


-P0 + P5 




U2 < 


-Ui+Pe 


//8 additions 




-U1+P4 






-U3 + P2 


'S'l ^ 5*0 — A^w 


c/5 < 


-U2-P3 


S2 ^ Anw — Asw 




-U2+Pi 


S3 <— ^ATB — Si 






To <— Bne — Bnw 


Cn 


<— Augment ( [/q, ^^4) 


Ti ^ BsE — To 


Cs 


^ Augment(C/5, J/g) 


T2 <— BsE — Bne 


C Stack(CAr, Cs) 


T3 <— Ti — Bsw 


return C 




end 





cutoff dimension (cq) and switches over to another multiplication algorithm. In our case, this 
is the M4RM algorithm. Thus the requirement can be relaxed to the requirement that for each 
recursion step the matrix dimensions must be divisible by two. 

However, this still is not general enough. Additionally, in case of F2 the optimal case is when 
m,n,l are 64 times powers of 2 to avoid cutting within words. To deal with odd-dimensional 
matrices two strategies are known in the literature [13]: One can either increase the matrix 
dimensions - this is called "padding" - to the next "good" value and fill the additional entries 
with zeros, yielding A"*" and . Then one can compute C+ = and finally cut out the 

actual product matrix C from the bigger matrix C"*" . A variant of this approach is to only virtually 
append rows and columns, i.e. we pretend they are present. Another approach is to consider the 
largest submatrices A~ and B~ of A and B so that the dimensions of A~ and B~ match our 
requirements - this is called "peeling". Then once the product ~ A~ B~ is computed, one 
resolves the remaining rows and columns of C from the remaining rows and columns of A and B 
that are not in A~ and B~ (cf. [13]). For those remaining pieces Strassen-Winograd is not used 
but an implementation which does not cut the matrices into submatrices. We use the "peeling" 
strategy in our implementation, but note that it is easy to construct a case where our strategy 
is clearly not optimal, Table 3 gives an example where "padding" would only add one row and 
one column, while "peeling" has to remove many rows and columns. This is an area for future 
improvement. 



Matrix Dimensions 


Time in s 


2^4 _ 1 X 2^4 _ 1 

214 X 214 
214 + 1 X 214 ^ ^ 


7.86 
6.09 
6.11 



Table 3: "Peeling" strategy on 64-bit Linux, 2.33Ghz, Core 2 Duo 



9 



To represent the submatrices in Algorithm 4 we use "matrix windows" as described earher. 
While this has the benefit of negligible required additional storage compared to out-of-place sub- 
matrices, this affects data locality negatively. To restore data locality, we copy out the target 
matrix C when switching from Strassen-Winograd to M4RM. On the other hand our experiments 
show that copying out A and B at this crossover point does not improve performance. Data 
locality for B is achieved through the Gray code tables and it appears that the read of x from A 
(cf. Algorithm 1) does not significantly contribute to the runtime. 

However, even with "matrix windows" Strassen-Winograd requires more memory than classical 
cubic multiplication. Additional storage is required to store intermediate results. The most 
memory-efficient scheduler (cf. [8]) uses two additional temporary submatrices and is utilized in 
our implementation. We also tried the "proximity schedule" used in FFLAS [14] but did not see 
any improved performance. 

5 Tuning Parameters 

Our final implementation calls Strassen-Winograd, which switches over to M4RM if the input 
matrix dimensions are less than a certain parameter Cq. If i? then has fewer columns than Ws 
(word size in bits) the classical cubic algorithm is called. This last case is quite common in the 
fix-up step of "peeling". This strategy gives three parameters for tuning. The first is Co, the 
crossover point where we switch from Strassen-Winograd to M4RM. Second, bg is the size for 
block decomposition inside M4RM for cache friendliness. Third, k dictates the size of the used 
Gray code tables. We always fix the number of Gray code tables to t = 8. 

By default Cs is chosen such that two matrices fit into L2 cache, because this provides the best 
performance in our experiments. For the Opteron (1MB of L2 cache) this results in Cs = 2048 and 
for the Core 2 Duo (4MB of L2 cache) this results in Cs = 4096. We only fit two matrices, rather 
than all three matrices in L2 cache as hg reduces the size of the matrices we are working with to 
actually fit three matrices in L2 cache. The default value is fixed at hg = Cs/2. The value k is set 
to [0.75 X log2 hs\ — 2. We subtract 2 as a means to compensate for the use of 8 Gray code tables. 
However, if additionally reducing fc by 1 would result in fitting all Gray code tables in LI cache, 
we do that. Thus, k is either [0.75 x log2 &sj — 2 or [0.75 x log2 &sj — 3 depending on the input 
dimensions and the size of the LI cache. These values have been determined empirically and seem 
to provide the best compromise across platforms. 

On the Opteron these values — Cs = 2048, hg = 1024, fc = 5, i = 8 Gray code tables — mean 
that the two input matrices fit into the 1MB of L2 cache, while the 8 Gray code tables fit exactly 
into LI: 8 • 2^ • 2048/8 = 64Kb. The influence of the parameter hg in the final implementation is 
shown in Table 4 for fixed k = 5 and Cg = 2048. 

On the Core 2 Duo these values are Cg = 4096, bg = 2048, k ^ 6, t = 8 and ensure that all data 
fits into L2 cache. Since the Core 2 Duo has only 32kb of LI cache we do not try to fit all tables 
into it. So far in our experiments, performance did not increase when we tried to optimize for LI 
cache. 



Matrix Dimensions 


bs = 2048 


bs = 1024 


bs = 768 


10, 000 X 10, 000 


2.96 


2.49 


2.57 


16,384 X 16,384 


13.23 


10.49 


10.37 


20, 000 X 20, 000 


21.19 


17.73 


18.11 


32, 000 X 32, 000 


67.64 


67.84 


69.14 



Table 4: Strassen-Winograd multiplication, 64-bit Linux, 2.6Ghz Opteron 
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6 Results 



To evaluate the performance of our implementation we provide benchmark comparisons against 
the best known implementations we are aware of. First, Magma [6] is widely known for its 
high performance implementations of many algorithms. Second, GAP [10] (or equivalently the C- 
MeatAxe [15]) is to our knowledge the best available open-source implementation of dense matrix 
multiplication over F2. Note, that the high-performance FFLAS [14] library does not feature 
a dedicated implementation for F2. In the Tables 5 and 6 we give the average of ten observed 
runtimes and RAM usage for multiplying two random square matrices. The timings for M4RI 
were obtained using Sage [16]. M4RI was compiled with GCC 4.3.1 on both machines and we used 
the options -02 on the Opteron machine and -02 -msse2 on the Core 2 Duo machine. 





Magma 2.14-14 


GAP 4.4.10 


M4RI-20080821 


Matrix Dimensions 


Time 


Memory 


Time 


Memory 


Time 


Memory 


10, 000 X 10, 000 
16,384 X 16,384 
20, 000 X 20, 000 
32, 000 X 32, 000 


2.210 s 
8.670 s 
16.030 s 
58.730 s 


85 MB 
219 MB 
331 MB 
850 MB 


6.130 s 
25.048 s 


60 MB 
156 MB 


1.504 s 
6.074 s 
10.721 s 
43.197 s 


60 MB 
156 MB 
232 MB 
589 MB 



Table 5: 64-bit Debian/GNU Linux, 2.33Ghz Core 2 Duo 





Magma 2.14-13 


GAP 4.4.10 


M4RI-20080811 


Matrix Dimensions 


Time 


Memory 


Time 


Memory 


Time 


Memory 


10, 000 X 10, 000 
16, 384 X 16, 384 
20, 000 X 20, 000 
32, 000 X 32, 000 


2.656 s 
10.260 s 
18.156 s 
67.237 s 


85 MB 
219 MB 
331 MB 
850 MB 


10.472 s 
43.658 s 


60 MB 
156 MB 


2.490 s 
10.490 s 
17.730 s 
67.840 s 


60 MB 
156 MB 
232 MB 
589 MB 



Table 6: 64-bit Debian/GNU Linux, 2.6Ghz Opteron 





Magma 2.14-16 


M4RI-20080909 


Matrix Dimensions 


Time 


Memory 


Time 


Memory 


10,000 X 10,000 
16,384 X 16,384 
20, 000 X 20, 000 
32, 000 X 32, 000 


7.941 s 
31.046 s 
55.654 s 
209.483 s 


85 MB 
219 MB 
331 MB 
850 MB 


4.200 s 
16.430 s 
28.830 s 
109.414 s 


60 MB 
156 MB 
232 MB 
589 MB 



Table 7: 64-bit RHEL 5, 1.6GHz Itanimir 
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